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submitted to search engines to represent the information needs of users. The proposed feedbacks 
sessions are clustered by data are bound to each cluster by means of a membership function. Feedback 
sessions are constructed from user click-through logs and can efficiently reflect the information 
needs of users. Pseudo- documents are generated to better understand the clustered feedbacks. Fuzzy 
C-means clustering algorithm is used to cluster the feedbacks. Clustering the feedbacks can effectively 
reflect the user needs. Fuzzy c-means algorithm uses the reciprocal of distances to decide the cluster 
centers. Ranking model is used to provide ranks to the URL based on the user search 
feedbacks. Evaluate the performance using "Classified Average Precision (CAP)" for user search 
results. 

Keywords: Fuzzy c means algorithm, member function, feedback sessions, pseudo documents, 
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I. Introduction 

It is a novel approach for user search result with their feedback session. First, we have to cluster the 
feedback session by using Fuzzy c-means algorithm. Second, a novel optimization method to map feedback 
sessions to pseudo-documents which can efficiently reflect user information needs. Third, evaluate the 
CAP of restructured web search results. Generally, data mining (sometimes called data or knowledge discovery) 
is the process of analyzing data from different perspectives and summarizing it into useful information - 
information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of 
analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, 
categorize it, and summarize the relationships identified. Technically, data mining is the process of finding 
correlations or patterns among dozens of fields in large relational databases. 

Data mining is the process of choosing, discovering, and exhibiting huge volumes of data to 
determine unknown patterns or associations useful to the data analyst. The objectives of data mining can 
be classified into two tasks: description and prediction. While the purpose of description is to mine 
understandable forms and relations from data, the goal of prediction is to forecast one or more variables of 
interest. 

Clustering is the most important concept used here. Clustering analyzes data objects without consulting 
a known class label. The objects are grouped or clustered based on the principle of maximizing the intra 
class similarity and minimizing the inter class similarity. Apriori algorithm is a methodology of association 
rule of data mining, is used to findout the frequently used URL. 

II. Feedback Session 

The proposed feedback session consists of both clicked and unclicked URLs and ends with the last 
URL that was clicked in a single session. It is motivated that before the last click, all the URLs have been 
scanned and evaluated by users. The clicked URLs tell what users require and the unclicked URLs reflect what 
users do not care about. It is more efficient to analyze the feedback sessions than to analyze the search results or 
clicked URLs directly. 

First, we are extracting the titles and snippets of the returned URLs appearing in the feedback session. 
Each URL in a feedback session is represented by a small text paragraph that consists of its title and 
snippet.Each URL's title and snippet are represented by a Term Frequency-Inverse Document Frequency (TF- 
IDF) vector, respectively, as in 
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whereT ui andS ui are the TF-IDF vectors of the URL's title and snippet, respectively, ui means theith URL in the 
feedback session. And Wj(j=l;2;...;n) is the j th term appearing in the enriched URLs.t wj and s wj represent the TF- 
IDF value of the j th term in the URL's title and snippet, respectively. 

The distributions of different user search goals can be obtained conveniently after feedback 
sessions are clustered. A novel optimization method isused to combine the enriched URLs in a feedback 
session to form a pseudo-document, which can effectively reflect the information need of a user. We infer 
the user goals by clustering, feedback sessions are proposed. Clustering the feedbacks can effectively reflect the 
user needs. 



III. Forming Pseudo Document 

We propose an optimization method to combineboth clicked and unclicked URLs in the feedback 
session. Let F fs be the feature representation of a feedback session and f fs (w) is the value for the term w. Let 
F ucm (m=l;2,...;M) and F ud (l=l;2, ...;L) be the feature representations of the clicked and unclicked URLs in this 
feedbacksession, respectively. Let f ucm (w) andf ud (w)be the values for the term win the vectors. We want to 
obtain such a F fs that the sum of the distances between F fs and each F ucm is minimized and the sum of the 
distances betweenF fs and each F ud is maximized. 



F fs = 



ff S (wl);ff S (w2);...f fs (wn) T ; 

ffs(w)^ffs(w) 



Ff S (w)=arg min 
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Let I c be the interval [jl f«p(w)- a f uc (w), |u f uc (w) + a f uc (w) ] and I ud be the interval \ji f uc i(w)- a 
fuci(w), jlx f uc i(w)+ a fu d (w) J, where \i f uc (w) and a f uc (w) represent the mean and mean square error of 
f uc (w)respectively, and \i f ud (w)and a f uc i(w) represent the mean and mean square error of 
f uc i(w),respectively.Even if people skip some unclicked URLs because of duplication. Each dimension 
of F fs indicates the importance of a term in this feedback session. F fs is the pseudo-document that we 
want to introduce. It reflects what users desire and what they do not care about. It can be used to 
approximate the goal texts in user mind. 
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IV. Fuzzy C Means Algorithm 

In Fuzzy clustering (also referred to as soft clustering), data elements can belong to more than one 
cluster, and associated with each element is a set of membership levels. These indicate the strength of the 
association between that data element and a particular cluster. Fuzzy clustering is a process of assigning these 
membership levels, and then using them to assign data elements to one or more clusters. In many situations, 
fuzzy clustering is more natural than hard clustering. Objects on the boundaries between several classes are not 
forced to fully belong to one of the classes, but rather are assigned membership degrees between 0 and 1 
indicating their partial membership 

The Fuzzy C-Means algorithm (FCM) is used in the areas like computational geometry, data 
compression and vector quantization, pattern recognition and pattern classification. Fuzzy C-Mean (FCM) is 
an unsupervised clustering algorithm that has been applied to wide range of problems involving feature analysis, 
clustering and classifier design. 

The main features of that algorithm were the (i) use of a fuzzy local similarity measure, (ii) shielding of 
the algorithm from noise-related hypersensitivities. FCM clustering techniques are based on fuzzy behavior and 
they provide a technique which is natural for producing a clustering where membership weights have a natural 
interpretation but not probabilistic at all.In fuzzy clustering, every point has a degree of belonging to clusters, as 
in fuzzy logic, rather than belonging completely too just one cluster. Thus, points on the edge of a cluster may 
be in the cluster to a lesser degree than points in the center of cluster. 

FCM clustering which constitute theoldest component of software computing, are really suitable for 
handling the issues related to understand ability of patterns, incomplete/noisy data, mixed media 
information, human interaction and it can provide approximate solutions faster. 

FCM has a wide domain of applications such as agricultural engineering, astronomy, chemistry, 
geology, image analysis, medical diagnosis, shape analysis, andtarget recognition. More the data is near to the 
cluster center more is its membership towards the particular cluster center. The basic idea of fuzzy c-means is to 
find a fuzzy pseudo-partition to minimize the cost function. Fuzzy c-means has been a very important tool for 
image processing in clustering objects in an image. In the 70's, mathematicians introduced the spatial term into 
the FCM algorithm to improve the accuracy of clustering under noise. Fuzzy c-means algorithm uses the 
reciprocal of distances to decide the cluster centers. 

This algorithm works by assigning membership to each data point corresponding to each cluster center 
on the basis of distance between the cluster center and the data point. More the data is near to the cluster center 
more is its membership towards the particular cluster center. Clearly, summation of membership of each data 
point should be equal to one. After each iteration membership and cluster centers are updated according to the 
formula. The FCM algorithm converges to a local minimum of the c-means functional. Hence, different 
initializations may lead to different results. The minimization of the c-means functional represents a nonlinear 
optimization problem that can be solved by using a variety of methods, including iterative minimization, 
simulated annealing or genetic algorithms. 

The Algorithm Fuzzy C-Means (FCM) is a method of clustering which allows one piece of data 
to belong to two or more clusters. This method is frequently used in pattern recognition. It is based on 
minimization of the following objective function: 

N C II II 2 

i-1 j-i \<m <00 

where m is any real number greater than 1, u t j is the degree of membership of x t in the cluster j, x t is the zth of d- 
dimensional measured data, c 7 is the d-dimension center of the cluster, and 11*11 is any norm expressing the 
similarity between any measured data and the center. 
Time complexity of FCM is O (ndc 2 i). 

V. Clustering Pseudo-Documents Using Fuzzy C Means Algorithm 

Each feedback session is represented bypseudo-document and the feature representation of the pseudo- 
document is F fs . We cluster pseudo-documents by Fuzzy c-means clustering which is simple and effective Fuzzy 
partitioning is carried out through an iterative optimization of the objective function shown above, with the 
update of membership w f7 - and the cluster centers Cj by 

Uii= 1 
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Vj=l, 2 ....c. This iteration will stop when 



maxii 



1 (k+l) „ (k) 



U ij <^ 



whereis a termination criterion between 0 and 1, whereas kis the iteration steps. This procedure converges to a 
local minimum or a saddle point of J m . 

FCM clustering is an iterative process. The process stops when the maximum number of iterations 
is reached, or when 

the objective function improvement between two consecutive iterations is less than the minimum amount 
of improvement specified. 



5.1 STEPS 

1) Randomly select ( c ' cluster centers. 

2) calculate the fuzzy membership using: 



1 



k-: 



A. 
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3) compute the fuzzy centers V/ using: 
N 

i-1 



N 

i-i 

vj=l,2....c. 



4) Repeat step 2) and 3) until the minimum 'J' value is achieved or 1 1 U {k+I) - U {h) 1 1 < p. 

Where, 

k' is the iteration step. 

p' is the termination criterion between [0, 1]. 
4 JJ = (jLiij) n * c ' is the fuzzy membership matrix. 
'J' is the objective function. 
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FCM is also called as Fuzzy ISODATA. FCM employs fuzzy partitioning such that a data point can belong to 
all groups which different membership grades between 0 and 1 . 

5.2 Parameters of the FCM algorithm 

Before using the FCM algorithm, the following parameters must be specified: 

■ the number of clusters, c, 

■ the fuzziness exponent, m, 

■ The termination tolerance, s. 

■ norm-inducing matrix, A 

Norm inducing matrixes are 3 types. They are 

• Euclidean norm 

• diagonal norm 

• Mahalanobis norm 

After clustering all the pseudo-documents, each cluster can be considered as one user search goal. 

VI. Evaluating Cap (Classified Average Precision) 

CAP (classified Average Precision) is used toevaluate the performance of user search goal inference 
based on restructuring web search results. A possible evaluation criterion is the average precision (AP) which 
evaluates according to user implicit feedbacks. AP is the average of precisions computed at the point of each 
relevant document in the ranked sequence, 
N 

AP= l/N + Xrel(r)R r /r , 
r=l 1 1 

Where N + is the number of relevant (or clicked) documents in the retrieved ones, r is the rank, N is the total 
number of retrieved documents, rel() is a binary function on the relevance of a given rank, and R r is the number 
of relevant retrieved documents of rank r or less. "Voted AP (VAP)" 

which is the AP of the class including more clicks namely votes. There should be a riskto avoid classifying 
search results into too many classes by error. We propose the risk as follows 

Risk= Z m i ;j=l(i<j)d ij 



It calculates the normalized number of clicked URL pairs that are not in the same class, where m is the number 
of the clicked URLs. If the pair of the i th clicked URL and the j th clicked URL are not categorized into one class, 
dy will be 1; otherwise, it will be 0. C 2 m =m(m-1) / 2 is the total number of the clicked URL pairs. 

We can further extend VAP by introducing the above Risk and propose a new criterion "Classified 
AP," as shown below 

CAP = VAP x (l-Risk) Y is used to adjust the influence of Risk on CAP, which can be learned from 
training data. 

VII. Conclusion 

In this paper, a novel approach has been proposed to user search results for a query by clustering its 
feedback sessions represented by pseudo-documents. Clustering feedback sessions are more efficient than 
clustering search results or clicked URL's directly. A new criterion called classified average Precision is used to 
evaluate the performance of restructured web search results. In this paper , we used Fuzzy c means clustering 
which constitute the oldest component of software computing, are really suitable for handling the issues 
related to understand ability of patterns, incomplete/noisy data, mixed media information, human 
interaction and it can provide approximate solutions faster. The execution time of FCM clustering 
algorithm for arbitrary data points depends only on the number of clusters and not on the data points. The 
distance between data points and some shape of the distribution, has the effect on the performance and behavior 
of the algorithm. Gives best result for overlapped data set and comparatively better then k-means algorithm. 
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